Zeynal Mardanli: Implemented optimized matrix multiplication by Lshiroc · Pull Request #5 · AA-parallel-computing/Assignment-4-Optional

Lshiroc · 2026-05-31T19:17:09Z

Optimizations:

Blocked matmul: Split matrices into 32×32 tiles so data stays in cache instead of being re-fetched from RAM on every access. Also reordered the inner loops (i-k-j) so memory reads are sequential.
Parallel matmul: Added #pragma omp parallel for over the row loop. Each row of C is independent, so no race conditions. Rows are split evenly across threads with schedule(static).

Challenges:

Test matrices are small (under 300×300), so they already fit in cache which made the blocked version show little to no speedup on most cases.
For very small matrices (e.g. 32×128×32), OpenMP thread startup overhead was larger than the actual computation, giving speedup below 1×.
The loop order in the assignment pseudocode (i-j-k) causes strided memory access inside a block, so I switched to i-k-j to fix that.

Zeynal Mardanli: Implemented optimized matrix multiplication

9f427a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zeynal Mardanli: Implemented optimized matrix multiplication#5

Zeynal Mardanli: Implemented optimized matrix multiplication#5
Lshiroc wants to merge 1 commit into
AA-parallel-computing:mainfrom
Lshiroc:zeynal-mardanli

Lshiroc commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lshiroc commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant